Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
نویسندگان
چکیده
We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.
منابع مشابه
Character-based PSMT for Closely Related Languages
Translating unknown words between related languages using a character-based statistical machine translation model can be beneficial. In this paper, we describe a simple method to combine character-based models with standard word-based models to increase the coverage of a phrase-based SMT system. Using this approach, we can show a modest improvement when translating between Norwegian and Swedish...
متن کاملA Character Level Based and Word Level Based Approach for Chinese-Vietnamese Machine Translation
Chinese and Vietnamese have the same isolated language; that is, the words are not delimited by spaces. In machine translation, word segmentation is often done first when translating from Chinese or Vietnamese into different languages (typically English) and vice versa. However, it is a matter for consideration that words may or may not be segmented when translating between two languages in whi...
متن کاملHybrid Word-Character Neural Machine Translation for Modern Standard Arabic
Traditional neural machine translation architectures use a word-level approach that assumes all important words have enumerable and relatively frequent surface forms. This assumption is invalid for the large number of non-analytic languages that form new words on the basis of complex morphological processes. Contributing to the quest of finding a universalist architecture that performs well for...
متن کاملCharacter-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries
In this work, we introduce the TESLACELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis – Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA-CELAB acknowledges the ...
متن کاملA Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level ...
متن کامل